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Abstract 

This paper studies Markov Decision Processes under parameter uncertainty. We adapt the distributionally 
robust optimization framework, and assume that the uncertain parameters are random variables following an 
unknown distribution, and seeks the strategy which maximizes the expected performance under the most adversarial 
distribution. In particular, we generalize previous study |[TJ which concentrates on distribution sets with very special 
structure to much more generic class of distribution sets, and show that the optimal strategy can be obtained 
efficiently under mild technical condition. This significantly extends the applicability of distributionally robust 
MDP to incorporate probabilistic information of uncertainty in a more flexible way. 

Index Terms 

Markov processes, robust control, parameter uncertainty, distributional robustness. 


I. Introduction 


Markov Decision Processes (MDPs) are widely used tools to model stochastic sequential decision mak¬ 
ing problems (e.g., Q-0). A strategy that achieves maximal expected accumulated reward is considered 
optimal. However, in practice, the transition probabilities and reward parameters are typically estimated 
from finite and possibly noisy data, which often deviate from their true values. Such deviation, called 
“parameter uncertainty”, can cause the performance of the optimal policies to degrade significantly (see 
experiments in Q). 

Inspired by the “robust optimization” framework in mathematical programming (e.g., @-0), many 
efforts have been made to alleviate the effect of parameter uncertainty in MDPs (e.g., ||T|, p^-p3]|). 
Most previous study (e.g., p0| , pl] |, pA]|-p^) focuses on the “robust MDP” which treats the uncertain 
parameter as a fixed yet unknown element of a given “uncertainty set”, and aims to find the strategy that 
achieves best performance under the worst parameter. This set-inclusive formulation of uncertainty can 
be conservative as it cannot incorporate probabilistic information of the uncertainty that is often available 
in practice (e.g., p^ , (T^). To overcome this, |[T| proposed the distributionally robust MDP approach, 
which can incorporates certain kind of probabilistic information of the uncertainty. More specifically, this 
approach treats the uncertain parameters as a random variable following an unknown distribution, while 
the distribution is known to belong to a set of distributions, called the “ambiguity set”, and the goal is to 
seek a strategy that archives the maximum expected performance under the most adversarial distribution 
of the uncertain parameters. Indeed, this approach is the multi-stage counter-part of the distributionally 
robust optimization (e.g., |[T|, p^ ) which considers the following: Given a utility function u{x, where 
X G A” is the optimizing variable and ^ is the unknown parameter, distributionally robust optimization 
solves max 2 ;e;t:'[inf^ecE^...,pti(a:,^)], where C is a-priori known set of distributions. 

We highlight our contributions by comparing with |[T|. In Q. the state-wise ambiguity set is restricted 
to the following form: Cg = {bs\ds (01) > Vi = 1,..., Ug}, where a* < and O* is a proper set 
of uncertain parameters with a “nested-set” structure, i.e., satisfying O* C Of for all i < j (see Fig. 
Ta| ). This setup can effectively model distributions with a single mode (such as a Gaussian distribution), 
but less so when modeling multi-mode distributions such as a mixture Gaussian distribution. Moreover, 
other probabilistic information such as mean, variance etc can not be incorporated. Thus, in this paper, we 
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extend the distributionally robust MDP approach to handle ambiguity sets with more general structures. 
In particular, we consider a class of ambiguity sets first proposed in (H as a unifying framework 
for modeling and solving distributionally robust single-stage optimization problems, and embed them 
into distributionally robust MDPs setup. These ambiguity sets are considerably more general: they are 
characterized by a class of O* which can either be nested or disjoint (as shown in Fig. lb), and moreover, 
additional linear constraints are allowed to define the ambiguity set, which can be used to incorporate 
probabilistic information such as mean, covariance or other variation measures. We show that, under this 
more general class of ambiguity set, the resulting distributionally robust MDPs remain tractable under 
mild technical conditions, and often outperforms previous methods partly due to the fact that it can model 
uncertainty in a more flexible way. 

The rest of the paper is organized as follows. Section |II] provides some background of uncertain MDPs 
and presents our problem setup and necessary assumptions. Finite horizon and discounted reward infinite 
horizon distributionally ambiguous MDPs are discussed in Section and Section |IV| respectively. We are 
particularly interested in the solution approach, i.e., how to solve the distributionally ambiguous MDPs. 
We present simulation results on a machine replacement problem and a path planning problem, both with 
parameter uncertainty in Section |V| which shows that the proposed approach outperforms the nominal, 
the robust, and the distributionally robust approach proposed by Q. We conclude the paper in Section 
VT} As this work generalizes Q. we reuse some of its results, which will be indicated when we encounter 


in the corresponding sections. 


II. Preliminaries 

Throughout the paper, we use capital letters to denote matrices, and bold face letters to denote column 
vectors. Row vectors are represented as the transpose of column vectors. We use 1 to denote the indicator 
function, and use ei{m) to denote the elementary vector of length m. M” denotes the set of nonnegative 
real n-vectors, and tr(A) denotes the trace of square matrix A. If C is the set of joint probability distribution 
of three random vectors a, b and c, then n(ab) ^ denotes the set of marginal distribution of (a, b). We 
use 0 to represent mixture distribution: given two probability distribution J^i, and a Bernoulli random 
variable x which takes value 1 w.p. p, xJ^i 0 (l —is a random variable such that it follows distribution 
J^i w.p. p, and follows w.p. 1 — p. We use J\f{m, cr^) to represent a Gaussian distribution with mean 
m and variance a^. 

A (finite) Markov Decision Process (MDP) is defined as a 6 -tuple (T, 7 , 5 ”, A, p,r). Here, T is the 
possibly infinite decision horizon; 7 G ( 0 , 1 ] is the discount factor; S is the state set and As is the 
action set of state s E S, both assumed to be finite. The parameter p and r are the transition probability 
and the expected reward, respectively. That is, for s e 5 and a e As,r{s,a) is the expected reward 
and p(s'|s,a) is the probability that the next state is s'. Following Q, we denote the set of all history- 
dependent randomized strategies by and the set of all Markovian randomized strategies by 

We use subscript s to denote the value associated with the state s, e.g., denotes the vector form of 
the rewards associated with the state s, and is the (randomized) action chosen at state s for strategy 
TT. The elements in the vector p* are listed in the following way: the transition probabilities of the same 
action are arranged in the same block, and inside each block they are listed according to the order of the 
next state. We use s to denote the (random) state following s, and A(s) to denote the probability simplex 
on Ag. We use to represent Cartesian product, e.g., p = (H)sesPs- 

For a given strategy tt e we denote the expected (discounted) total-reward under parameters pair 
(p, r) as ^ 

«( 7 r, p, r) A EjP’*’) | ^ 7 '~^r(si, o^) 

I i=i 

A Distributionally Ambiguous MDP (DAMDP) is defined as a tuple (T, 7 , S, A, Cg), where the transition 
probability p and the expected reward r are unknown. Instead, they are assumed to obey a joint distribution 





3 


Hq (also unknown) that belongs to a known ambiguity set Cs = n(p r) where ^(p r) i^ieans taking the 
marginal distribution of (p,r). 

While the DAMDP framework can be very general, most Cg results in formulations computationally 
intractable (e.g., Q, Hence, our first requirement of Cg is that the parameters among different states 
are independent. That is, 

Cg = e e S'}, 

ses 

where “state-wise ambiguity set” Cg is a set of distributions of parameters of state s. By the definition of 
Cg, the state-wise property applies to Cg as well. This property is same as the concept of “s-rectangularity” 
in p^ , and is essential for reducing DAMDP to robust MDP in Lemma [T| In addition, p0| | showed that the 
robust MDP with coupled uncertainty sets is computationally challenging, which implies solving DAMDP 
with nonrectangular ambiguity sets is even harder. 

We now discuss the admissible state-wise ambiguity set. Our formulation of the state-wise ambiguity 
set follows the unifying framework of [ fT8] |. In specific, given s E S, the state-wise ambiguity set is 
representable with the following standard form 


Cs^\ 

\ 

I Ms 

^ Ps\ 
rg 

\ 

[ 

[ngj 


IE(pj^rs,Us)~/is [-^sPs T T 


( 1 ) 


Here, Fg G G G M.^^^,Cg G Ig = {1,2,... ,ns} is an index set and 


Oi c X 

set”; and al,al G 


I X 
[ 0 , 1 ], «; 


is a set of possible values of the parameters (ps,rs,Us), termed “confidence 
< (Fg for all i E Ig, is the lower bound and upper bound of the probability 
that the parameters belong to the confidence set. Thus, each confidence set O* provides an estimation 
of the uncertain parameters pair (ps,rs,Us) subject to a different confidence level. Ambiguity sets Cg 
contain prescribed conic representable confidence sets and mean values residing on an affine manifold, 
which turns out to be rich enough to encompass and extend several ambiguity sets considered in the recent 
literature. We abuse the notation here to use Cg to denote the set of joint distributions of (p^, r^, Us), and 
hence the set of joint distribution of (p^, r^) is Cg = Cg. 

Notice that here we use a classical technique called “lifting”: we introduce an auxiliary random vector u, 
so that some non-linear relationship can be modeled linearly. For example, as we see below, a constraint on 
the variance can be modeled using this standard form, which is otherwise impossible if auxiliary variable 
is not introduced. This lifting technique thus allows us to model a rich variety of structural information 
about the marginal distribution of (p, r) in a unified manner. 

Note that, when Ug — 1 and the distribution set involving solo probability bound is of the form 
Cg = {/is(ps,Ts, Us)|/is (O}) = l,Vz G Is,s E 5”} (i.e., the distribution set only contains the support of 
random variables), the DAMDP reduces to the classical robust MDP formulation ( [ fT0| |, [fTT]|), where the 
a-priori information of unknown parameters is that they belong to an uncertainty set. 

Assumption to 1^ are standard requirements for the confidence sets, proposed in p8] |. The first one 
asserts the relationship between different confidence sets. 

Assumption 1 (Nesting condition): For any s G 5”, alH,z' G Ig and i ^ i', wt have either 0\ (s O*', 
Og d 0\ or 0\ n O*' = 0. 

Here 01 d Ot means that a set 0\ is strictly included in a set O*', i.e., 0\ is contained in the interior 


of 0\. The nesting condition is illustrated in Fig. lb The nesting condition implies a strict partial order 


on the confidence sets 0\ with respect to the d-relation, with additional requirement that incomparable 
sets must be disjoint. Nested condition is a fundamental assumption needed for the tractability of the 
distributionally robust optimization problems (cf. |j^|). We remark here that when Cg is of form ([T]), it 
trivially satisfies the nesting condition. Nesting condition can be verified analytically. Interested readers 


may refer to |I8| Section 3. 
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Fig. 1. Illustration of the confidence sets. 


We highlight the contributions of the paper by comparing with [[T| who also propose a distributionally 
robust MDP approach. In |[T|, the state-wise ambiguity set is restricted to the following form: 


Cs = {01) > Vz = l,...,nJ; 


where a* < O* C , Mi < j. 

Observe that our formulation is more general as we allow additional linear constraints (in terms of 
expectation). This allows us to apply more statistical methods to estimate the uncertain parameters in 
MDPs. Moreover, their formulation requires a more restrictive nesting set condition where the confidence 
sets must have an incremental structure Ol C C ■ ■ ■ C 0^“, which is illustrated in Fig. While, as 
shown in Fig. [T^ the confidence sets we use can be disjoint and thus provide significantly more powerful 
modeling ability. This new structure is obviously more flexible, and can model more general distributions. 
For example, it can better characterize multimodal distributions such as mixed Gaussian distributions: We 
may establish confidence sets centered around corresponding peaks of different Gaussian components, as 
long as they do not intersect with each other. 

Next, for any s E S we require that Cg satisfy the following regularity condition. 

Assumption 2 (Regularity conditions for Cg): 1) The confidence set O”® is bounded and has prob¬ 
ability one, that is, a”* = = 1. 

2) There is a probability distribution ;Us(ps, r^, Ug) e Cg such that Ps{0\) e whenever 

a* < i e Ig. 

The condition 1 of Assumption ensures that the confidence set with largest index, O”®, contains the 
support of the joint unknown parameters pair (p^, r*, Ug). The second condition stipulates that there is a 
probability distribution /is(ps, r^, Ug) G Cg that satisfies the probability bounds in ([T]) as strict inequalities 
whenever the corresponding probability interval is non-degenerate. This assumption allows the 

strong duality to hold for distributionally robust counterpart, which we will define later. 

For each individual 0\, we make the following assumption for tractability. 

Assumption 3: For any s E S and i E Ig, 01 is nonempty and convex. Each confidence set O* is 
defined as 




pA 


\nsj 


E rOs\x\s\ X ^\Ag\ ^ 


Bipg + Dlrg + El 
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b* G Ki are proper cones (i.e., a closed. 


where 5* G Eiix(l^*|x|«l), Dl E El E 

convex and pointed cone with nonempty interior). 

Similarly to p0| , we define the non-stationary model as: when a state is visited multiple times, each 
time it can take a different parameter realization. We define the stationary model as: uncertain parameters 
are chosen by nature depending on the decision maker’s strategy once for all, and remain fixed thereafter. 
Each model leads to a game between the decision maker and nature, where the decision maker seeks to 
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maximize the minimum expected reward, with the nature being the minimizing player. The second model 
is attractive for statistical reasons. Unfortunately, it turns out to be hard to solve. Therefore in this paper, 
for finite horizon case, we assume it shares the same property of non-stationary model: multiple visits to 
a state can be treated as visiting different states. This can easily be done by introducing dummy states. 

For finite horizon case, as 0 we make the following assumption, which will simplify our derivations. 

Assumption 4: 1) Each state belongs to only one stage. 

2) The terminal reward equals zero. 

3) The first stage only contains one state s*”*. 

Using the condition 1 of Assumption we partition S according to the stage each state belongs to. 
That is, we let St be the set of states belong to stage. 

III. Finite horizon distributionally robust MDPs 

This section focuses on uncertain MDPs with a finite number of decision stages. We show that a strategy 
defined through backward induction, which we called S-robust strategy, is a distributionally robust strategy. 
We further show that such strategy is solvable in polynomial time under mild technical conditions. This 
thus generalizes similar results in |[T| to a significant more general class of ambiguity sets. 

For TT G and p, e Cs, we denote the expected performance of a DAMDP as 

w{n, p, (s*™)) = E(p,r)~A.{M(7r, p, r)} = J u( 7 r, p, r)dp{p, r). 

Definition 1: A strategy tt* G fl^^ is distributionally robust with respect to Cs if it satisfies that for 
all TT G 

inf < inf w{n*,p',U^fi). 

t^eCs ^ ^ - peCs 

In words, each strategy is evaluated by its expected performance under the (respective) most adversarial 
distribution of the uncertain parameters, and a distributionally robust strategy is the optimal strategy 
according to this measure. 

The main focus of this section is deriving approaches to solve the distributionally robust strategy. To 
this end, we need the following definition. 

Definition 2: Given a DAMDP (T, 7 , S, A, Cs) with T < 00 , we define the S-robust problem as 
following 

1) For s G St, the S-robust value vt{s) = 0. 

2) For s G St, where t <T, the S-robust value vfis) and S-robust action tTs are defined as 

vfis) = m^ {min «) + 7 ^^t+i(s)]}}, 

7rsGA(s) fJ^seCs 

^ max {min 

7rsGA(s) fMs&Cs 

3) A strategy tt* is a S-robust strategy if Vs G S, and every history h ends at s, we have tt*, conditioned 
on history h, is a S-robust action. 

Note that the definition essentially requires that the strategy must be robust with respect to each sub¬ 
problem, and hence the name “S-robust”. The following theorem shows any S-robust strategy tt* is 
distributionally robust, and is the main result of this paper. 

Theorem 1: Let T < 00 . Under Assumptions and if tt* is a S-robust strategy, then 

1 ) TT* is distributionally robust strategy with respect to Cs- 

2) There exists p* G Cg such that {tt*,p*) is a saddle point. That is, 

sup w{tt, p* = w{tt*, p* = inf w{tt*, pfis^'^fi). 

7ren^« MeCs 
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Proof: The proof follows a similar structure as that of Theorem 3.1 in |[T|; We first state a lemma 
from o which shows that for a given strategy, the expected performance under admissible /i depends 
only on the expected value of the parameters. Thus we are able to reduce the distributionally robust MDPs 
to the classical robust MDPs. Then we show that the set of expected value of the uncertain parameters is 
convex and compact. Finally, by applying the results of classical robust MDPs we prove the theorem. 

The following Lemma is indeed Lemma 3.2 of |[T|. Hence we state the result without the proof. Interested 
readers may refer to |[TJ for the proof. 

Lemma 1: Under the state-wise property of the ambiguity set Cs, fix vr G and /i G Cs, denote 
p = E^(p) and f = E^(r). We have 

w(7r,/i, (s****)) = M(7r,p,f). 

Lemma [T] essentially means that for any strategy, the expected performance under an admissible 
distribution only depends on the expected value of the parameters under fi. Thus, the distributionally 
robust MDPs reduces to the robust MDPs. Next we characterize the set of expected value of the parameters. 

Lemma 2: For s E S and tt* G A(s), we define the set 

= {E^^(p«,rs)|//^ G Cs}. 


Then set Zg is convex and compact. 

Proof: First, we show that, for s G S' and G A(s), the set defined as Zg = {E^^ (Ps, u^)|/Us ECg} 

is convex and compact. 

Fix s E S, and two distributions E Cg, and A G [0,1], we have 

AE^i(p„r„u,) + (1 - A)E^2(p^,r«,u,) = EA^i+(i_AV2(Ps,r«,u,), 

which holds due to the linearity of the expectation operation. Next, we show A/i^ -I- (1 — A)//^ E Cg. Since 

G Cg, for s G S, we have. 


A [ (Fgpl + Ggr] + Hgul)dfil{pl,rl,ul) + (1 - A) [ {Fgpl + + Hgiil)dffl{P‘1, 

Jo^ Jo^ 

Acg “1“ A^Cg Cg, 

A / li^piri,ui)GOi]dfil{pl,rl,ul) + (1 - A) / l[(p2,r2,fi2)go|]C?Ai^(Pg, r^, uD 
Jov^ JO^ 




A 


/ hpl,rl,ul)&oi]dpil{pl,rl,ul) +{1-X) / l[(p2,r2,fi2)go|]C^/^g(Pg,r2,u; 

Jo^ JO'} 


A Aftg 3“ (1 — A)ftg — oig, Vi G Ig. 


Hence the convexity follows. To show that compactness, notice that Cg is weakly closed (i.e., closed 
w.r.t. to the weak topology) since the feasible set of each of constraint is weakly closed and so does 
their intersection. Thus, Zg is closed since it is the image of Cg under expectation (which is a continuous 
function). This implies Zg is compact since O”® is bounded and hence Zg is bounded. 

Finally, since Zg is the projection onto the first two coordinates of set Zg, its convexity and compactness 
are straightforward at this stage. ■ 

Lemma 1^ implies that, for s G S' and tt* G A(s), there exists (p*,r*) G Zg that satisfies 

inf u{'Kg, pg, Tg) = u{wg, p*, r*). 

{Ps,rs)eZs 

Furthermore, we can construct ^g^s^s = {Ep(p,r)|// G Cs) by state-wise decomposability of Cg. 

We complete the proof of Theorem [T] by using the equivalence of distributionally robust MDPs and 
robust MDPs where the uncertainty set is ^s- It is well known that for robust MDPs , saddle point 
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of the minimax 
such that 


objective exists ( pl]|). More precisely, there exist tt* e 11^^, (p*,r*) e 

inf mItt*, p, r) = mItt*, p*, r*) = sup u(7r,p*,r*) 

(P:r)G0^gs2s 7ren^f« 


holds. Moreover, we can construct tt* and (p*, r*) state-wise as tt* = (S)sg 5 ^nd (p*, r*) = (H)se 5 (Ps) r*). 
For each s e St, tt* and (p*,r*) solves the zero-sum game 

max min [r(s, a)-|- 7 ?;t+i(s)]. 

7rseA(s) (ps,rs)G2s 

Thus TT* is any S-robust action, and hence tt* can be any S-robust strategy. From Lemma there exists 
/i*(p*,r*) G Cs satisfying E^|(p 3 ,rs) = (p*,r*). Let ji* = <S}s&s applying Lemma [j we have 

sup tt;(7r,//*, (s*™)) = sup u( 7 r,p*,r*), 

neuRR 

w{tt*, jj*, ( s ****)) = u{tt*, p ^ r ”^), 
inf wItt,/U, (s*™)) = inf u( 7 r*,p,r). 

/^GCs ' ^ ' (P.r)e®,gs2s 

Above leads to 

sup w{tt, fj*, (s*”*)) = w(tt*, fi*, (s*”'*)) 

■K&nRR 

= inf w(tt*,iiJs^^^)). 

H&Cs 

Thus, part (2) of Theorem holds. Part (1) follows immediately. ■ 

Define Vj+i as the vector form of Uf+i(s') for all s' G St+i, and 



vt+ie|(|A,|) 


Vt+ief4,|(l^l) 


Therefore, the expected reward under fixed parameter realization (ps,rs) and any policy G A(s) is 


E[P"’‘‘")[r(s,a) -Fut+i(s)] 

= XI '^s{a)r{s,a) + X X ^s{a)p{s'\s, a)vt+iis') 

aeAs aeAs s'eSt-\-i 



We now investigate the computational aspect of the S-robust action. 

Theorem 2: For s e St where t < T, the S-robust action is the optimal solution of the following 
optimization problem. 

minimize w 

subject to cJ/3 + — a*Aj] < w 

ieh 

“ X Is 

i'eA(i) 

B'g Pi + V^TTs + FJ(ii = 0, i E Is 
D'Jpi -|- TTg -f Gl(3 = 0, i E Is 
E'Jpi -|- 111(3 = 0, i € Is 
TTs G A(s), 

/5GE^ k,AgE!;.7 PiEKi*. 




Here, Kf represents the cone dual to Kl, and set A{i) = {i} U {i' e /« : O* d O*'}. 

Proof: First, for s e St, since the expectation term in @ is independent of the auxiliary random 
vector Us, the S-robust problem can be written as 

minimize w 

W, 1 Ts 

subject to max «) + 7 ^’t+i(s)]} < w 

/Xs ^Cs 

TTs G A(s). 

Now we consider the constraint, which we call distributionally robust counterpart: 

max + 7 'yt+iU)]} < w. 

jXs GCs 

The distributionally robust counterpart can be equivalently expressed as below 


maximize 

fXs 

- / -F7{;t+i(s)](i/Zs(Ps,rs,Us 

Jor 

subject to 

/ {FsPs + GsVs + HsUs)dps{Ps, r*, u*) = c* 
Jo^‘> 


1 l[(ps,rs,us)eo|]^^s(Ps) Ug) > OgjVi G 


/ I[{ps,rs,Us)&Oi]diCs{Ps,rs,Us) < K^Vi G Is 
Jo^‘> 


Theorem 2 of states that for the above maximization problem, the strong duality holds under 
Assumption [T] and The dual problem is of the following form: 


minimize 

/5,/^,A 

subject to 


ieh 

[FsPs + GsYs + HsUsYl3 + X ~ ^ a) + 77 t+i(s)], 

i'eA(i) 

V(ps,rs,Us) G Of Vi G Is 

/5GM^ K,XeWf 


< w, (3) 


where A{i) = {i} U {i' G /« : O* d O*'}. 

The semi-infinite constraint in the above problem can be written as, for each i G /«, 

maximize - {[F^Ps + GsVs + HsUsYP + X 

Ps,rs,Us •,' 777 •^ 

subject to BIps + DIts + F*Us b*. 

The above problem is a convex optimization problem. By applying the duality theory of convex 
optimization |j^, we get its dual below. For each i G /«, 

minimize t'/b* — [k*/ — A*/] 

i'eA{i) 

subject to BYOi -\- Vg'Ks + FJ/3 = 0 
DY I'i + TTg + Gifl = 0 

EYoi + HlY = ^ 

G e kY. 
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Thus, the semi-infinite constraint can be equivalently reformulated as follows: there exists Vi G Kl* , 
Cji'i + Vs'Ks-\-A'^ 1^ = 0, Dji'i + TTs + B'^lS = 0 and + = 0 such that - 0- 

Finally, by combining constraints for all i G Z®, we obtain the optimization formulation stated in the 
theorem. ■ 

Thus, since for s G St, A(s) is compact, we can solve the S-robust action in polynomial time if all Kl 
are “easy” cones such as linear, conic quadratic or semidefinite cones. Moreover, using Theorem mby 
backward induction, we can obtain the S-robust strategy efficiently. 

By virtue of the lifting technique (T^ Thm 5], we show below that several widely used ambiguity sets 
are indeed special cases of Cg defined in ([T]). We further list their corresponding S-robust problems. 

Example 1 (Mean): Assume that we only know a noisy empirical estimator of the exact mean of p*. 
That is, given G G f g and ~ PsiVs), [Ps] Aiv f, where AT is a proper 

cone. 128] shows that Cg, which involves the auxiliary random vector Ug G can be expressed as 

[w] = f, 

As (Cps dlK U 3 ) = 1 

Note that As(Ps) ^ Ilps^s’ where Ilp^^s denotes the marginal distribution of p* under the joint 
probability distribution of two random vectors p* and u^. The problem in Theorem]^ now takes the form 




minimize 
subject to 


w 


K + Fi/ < w 
K -b rjTTs > 0 
VgT^s ”b G^v = 0 
-Ks G A(s), p G K*. 

The same argument can be developed for mean of r^. This example can also be treated via “classical” 
robust optimization models by virtue of Lemma 

Example 2 (Variance): This example imposes conditions on both mean and covariance of the distribu¬ 
tion. First, we assume that the mean of the random reward ~ Ps{^s) is given by = m, 

and — ni)(rs — m)'''] A 17 for 17 G We denote the ambiguity set under fixed mean m 

as Cs(m). As discussed in |18|, Cs(m), which involves a random matrix Ug G can be expressed 

as 

[r^] = m, [Ug] = 17, 

1 (r^ - m)T 

_(rs - m) Ug 

We have As(J's) G Hr where Cs(m) denotes the marginal distribution of r* under the joint 

probability distribution of random variables and Ug. Next, we consider the case where given G G 
]^Mx|As|^ f G M^, the mean of random reward is restricted by Gm < f. In this case, the ambiguity 
set can be written as Cg = UGm<f^s(^)- Theorem ^ the S-robust action can be solved using the 
following formulation 


C«(m) 


As 



As 


>- 0=1 


minimize 
subject to 


w 

K + tr(SF 22 ) “ A u) 

n + plVgTTg - Yu - Fy > 0 

TT, - Yu - Y 21 -G^p = 0 

Yu + F 21 + GY = 0 

Vii FiT 

F 21 ^22 


F- 


= 0 


TTg G A(s), T A 0, F G S. 


|x4s 1+1 
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The same argument can be developed for variance of p^. 
Example 3 (Mean Absolute Deviation): Assume that 
shows that Cg, which involves the auxiliary random vector iis G 


|rs — m|] < f for m, f G 
I, can be expressed as 


|18| 



[fig] 


= f, 


^s) 


/ Us > - m\ 

V Us > ni - r^y 



Note that Ps{^s) G Ilrs where Hrs denotes the marginal distribution of under the joint 
probability distribution of random variables Fs and Us. The problem in Theorem in this case can 
be represented as 

minimize w 


subject to K — Up < w 

+ pIKtTs + m‘'’7rs > 0 
TTs G A(s), p > 0. 


The corresponding S-robust problem can be obtained for mean absolute deviation of Ps. 

Example 4 (Expected Huber Loss Eunction): For a scalar G M the Huber Loss function is defined as 

Hs{z) = 

where 5 > 0 is a prescribed robustness parameter. 

Assume that Er^^^^(r^)[i/ 5 (f'''rs)] < g for f G and g 


r 

if \z\ < 8, 

1 ^(kl - ¥) 

otherwise, 


[181 shows that Cg, which involves 


the auxiliary random variables Ug,Vg,Wg, Sg,tg G M+, can be expressed as 


\ bs 


rs,Ug,\ 


^ I — Ss) + ^ + 5(vs — ig) + ^ < Wg, \ _ 


Ug>Sg, Vg>tg, rrg = Ug-Vg 

Note that pg{rg) G Hr where Hr denotes the marginal distribution of under the joint 
probability distribution of six random variables. The S-robust problem in Theorem is 


minimize w 

K;,7rs,/c,z/,7,?7,^ 


subject to K — gp < w 

K + pjtyvTs + 3S'y + 3Srj — 3S^p — 
Wg + Of = 0 


87 ^ Zrf 
2p 2p 


> 0 


7 — (5z/ — 6 * = 0 
g — 8 p -\- 0 = 3 
TTs G A(s), p,g,g>0. 


The corresponding S-robust problem for the case of transition probability can be obtained in a similar 
way. 

Before concluding this section, we briefly discuss the computational complexity. We denote M as the 
maximal computational effort in calculating possibly randomized S-robust action vr^ for each s E St- Thus, 
the total time-complexity is 0(T\S\M). 

Remark 1 (Stationary & non-stationary model): The S-robust value we derived in this section is for 
non-stationary model. It provides a lower bound for stationary model, which is generally hard to solve. 
Thus, one can use the non-stationary model to approximate the stationary model, when the latter is 
intractable. 
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IV. Infinite horizon distributionally robust MDPs 

In this section, we study the distributionally ambiguous MDP in the discounted-reward infinite horizon 
setup. In specific, we generalize the notion of S-robust strategy proposed in Section 3 to discounted-reward 
infinite horizon case, and show that the S-robust strategy is distributionally robust. Similarly to |[TJ, we 
consider two models, namely, the non-stationary model and the stationary model. The non-stationary model 
treats the system as having infinitely many states, each visited at most once. Therefore, we consider an 
equivalent MDP with an augmented state space, where each augmented state is defined by a pair (s,f) 
where s e S' and t, meaning state s in the horizon. We define the set of distributions as the Cartesian 
product of the admissible distribution of each (augmented) state. That is, 

— i~ l^s,u l^s,t ^ ^ S, Vf = l,2,... 

I sGS',t=l,2,... 

The stationary model treats the system as having a finite number of states, while multiple visits to one 
state is allowed. That is, if a state s is visited for multiple times, then each time the distribution (of 
uncertain parameters) fig is the same. We define the set of admissible distribution as 

C g \ G (S', VA 1,2,... 

I ses,t=i,2,... 

These two formulations can model different setups: if the system, more specifically the distribution of 
uncertain parameters, evolves with time, then non-stationary model is more appropriate; while if the system 
is static, then stationary model is preferable. 

The S-robust strategy for infinite horizon distributionally robust MDPs is defined as follows. 
Definition 3: Given a DAMDP (T, 7, S, A, Cg) with T = 00, we define the S-robust problem as 
following 

1) The S-robust value i)oo(s) and S-robust action tts are defined as 

v^{s) = max {min o) + 71)00(5)]}}, 

7rsGA(s) f^s^Cs 

e arg max { min a) + 7^00(5)]}}. 

7rsGA(s) f^s^Cg 

2 ) A strategy n* is a S-robust strategy if Vs G S, n* is a S-robust action. 

To see that the S-robust strategy is well defined, it suffices to show that the following operator £ : 
is a 7 contraction with respect to norm. 

{£v}(s)= max min{i:5;;v}(s), 

7rsGA(s) jAg^Cs 

where for vi,V 2 , given any G A(s), and fXg G Cg, 

{£(^“v}(s) A +7 f(s)] 

= TTs{a)r{s, a) + 7 X] S] 7rs(a)p(s'|s, o)t;(s'). 

aeAg aeAg s'eS 

Lemma 3: Under Assumption for 0 < 7 < 1, £ is a 7 contraction with respect to Aoo norm. 

Proof: For arbitrary vi,V 2 , and fix s G (S'. Let tt},/ i}(p}, r}) and be the respective 

maximizing and minimizing variables. Assume {(Cvi}(s) > {(Cv2}(s), we have 

0 < {£vi}(s) - {£v2}(s) 

= {4vi}(s) - {C^lv.Ks) 

< {^^^?vi}(s) - {£^fv2}(s). 
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The right hand side term can be expressed as 

{£^fvi}(s) — {£'^fV2}(s) 

aeAs s'es 

< 7 XI X^s^“)^2(s'|s,o)||vi - V2II00 

aeAs s'eS 

= 7||vi - valloo- 

Last equality holds due to the fact that is a vector on the probability simplex and a) = 

1 . Thus, for any vi, V2, given any G A(s), and fig G Cg, C^l is a 7 contraction. 

The above derivation establishes that when {£vi}(s) > {£v2}(s) we have 0 < {£vi}(s)-{/:v2}(s) < 
7||vi — V2||oo- Repeating this argument in the case that {i 2 vi}(s) < {i2v2}(s) implies that 

|{/:vi}(s) - {/:v 2 }(s)| < 7 ||vi - V2II00 


for all s G S'. Taking the supreme over s in the above expression yields the result. ■ 

Note that for given any v and each s, by applying Theorem S-robust strategy can be obtained 
efficiently. Banach Fixed-Point Theorem states that, there exists a unique v* such that Cw* = v*, which 
is the S-robust value by definition. Moreover, for arbitrary v°, the value vector sequence defined by 
^riyO _ converges exponentially to v*. Therefore, as the following lemma shows, we can compute 
the S-robust action for each s (and hence S-robust strategy) using this procedure. 

Lemma 4 : Given s E S. Let v” = and 

< G arg max +7^’'(s)]}}- 

7 rsGA(s) /aseCs 


Then the sequence has convergent subsequences, and any of its limiting points is a S-robust 

action of state s. 

Proof: Since A(s) is compact, the sequence has convergent subsequences. To show that any 

limiting point is a S-robust action, without loss of generality, we assume vr” —?• tt*. 

For V, v', and given any Hs G Cg, its G A(s), we note that is a 7 contraction (see the proof of 
Lemma |^, that is, 

{£^:v}(s)<{£^y}(5)+7||v-v'|U. 

By definition, for any tts G A(s), we have 

min{A^®v”}(s) < min{A^uv”}(s). 

Therefore, for v* defined by Cv* = v*, we have 


By Lemma we denote /u* 


> 


min{i2(^nV*}(s) — min {£^*v*}(s) 


{min{A(;jv*}(s) - 

{min{/:(;y*}(s) - 
+{min{£^y”}(s) 

fJ>s tCs 


min{£Cy’^}(.)} 

- mm{C^y*}{s)} 

m|{£Cyn(«)} 


> - 27 ||v--v*||oo. 

= argmin^^ec.{A^'v*(s)}. We have 
lim min{£^uV*}(s) < lim {A^uV*}(s) 


= min{A^?v*}(s). 


( 4 ) 


( 5 ) 
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Here the first equality holds since {£^*v}(s) is continuous on and tt” ^ tt*, and the second equality 
holds due to the definition of fi*. Combine Q and and note that —?• v* due to Lemma we obtain 

min > min{i2^®v*}(s). 

Hence, tt* is a S-robust action of state s. ■ 

Based on these lemmas, we have the following two theorems which show that the S-robust strategy is 
distributionally robust. Indeed, these two theorems are similar to Theorem 4.1 and 4.2 of |[T|. We omit 
the proofs as they are identical to those of |[T|. 

Theorem 3: Under Assumption]^ given T = oo and 0 < 7 < 1, any S-robust strategy is distributionally 
robust with respect to . 

Theorem 4: Under Assumption]^ given T = 00 and 0 < 7 < 1, any S-robust strategy is distributionally 
robust with respect to C 5 . 

In terms of computation, to achieve an accuracy of e, the computational complexity of infinite horizon 
case should be ( 9 (r|S'|Mloge/log 7 ). 

Remark 2 (Stationary & non-stationary model): For any given strategy. Theorem ]^ and ]^ implies that, 
when T —> 00, the distributionally robust strategies for both formulations coincide, and can be computed 
by iteratively solving the S-robust problem defined in Definition ]^ 

V. Simulation 

In this section, we study two synthetic numerical examples: a machine replacement problem and a 
path planning problem. In the machine replacement problem, the reward is assumed to be uncertain; 
whereas in the path planning problem, the transition probability is uncertain. Our numerical simulation 
shows that by incorporating more general probabilistic information, the proposed distributionally robust 
approach handles uncertainty in a more flexible way, and hence leads to a better performance than the 
nominal, robust and distributionally robust approach proposed in Q. All results were generated on an 
Intel Core 15-3570 CPU with 3.40 GHz clock speed and 8 GB RAM. The S-robust problems in robust 
and distributionally robust approaches are solved in Matlab using the CVX package p^. 


A. Reward uncertainty in the machine replacement problem 

We consider a machine replacement problem similar to that considered in p2] |. Consider the repair 
cost incurred by a factory that holds a high number of machines, given that each of these machines is 
modeled with a same underlying MDP for which rewards are subject to uncertainty. Note that we select 
two simple instances of machine replacement problem to better illustrate how our method compares with 
the nominal and the (distributionally) robust approaches. In fact, our method remains computationally 
tractable for much larger scale machine replacement problems with more than 1,000 states. 

1) Machine replacement as a MDP with Gaussian rewards: In our first experiment with Gaussian 
rewards MDP, we consider a machine replacement problem with 50 states, 2 actions (“repair” and “not 
repair”) for each state, deterministic transitions, a discount factor of 0.8, and Gaussian distribution of the 
uncertain rewards (see Fig. ]^. For each of the first 48 steps, the “repair” action has a cost independently 
distributed as AA(130,1). The 49th and 50th states of the machine’s life are designed to be risky: not 
repairing at state 50 incurs a highly uncertain cost A/’(100,800), while repairing at both states is a more 
secure but still uncertain option with a cost A/’(130,10). The detailed implementation is as follows: By 
observing sample data generated from the true Gaussian distribution, we are able to compute the estimated 
mean m and variance We use the mean value of uncertain rewards to compute the nominal strategy. 
For both robust and distributionally robust strategy, we construct confidence sets using m ± 3(T for the 
first 49 states, and m ± 4(7 for state 50, as it is more risky and thus hard to estimate. In addition, we 
construct an extra 62.32% confidence interval O|o (centered at the mean) with 60% — 70% confidence 


level (i.e., « 5 q = 0.6 and al^ = 0.7) for distributionally robust strategy. The optimal paths followed for 
three strategies are shown in Fig. ]^ 
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- Nominal strategy 

-Robust strategy 

-Distributionally robust strategy 


Fig. 2. Instance of a machine replacement problem with Gaussian uncertainty in the rewards. 


TABLE I 

Average total discounted rewards and computational times oe nominal, robust, and distributionally robust 

STRATEGIES IN MACHINE REPLACEMENT PROBLEM WITH GAUSSIAN REWARDS 



Strategies 

Nominal 

Robust 

Distributionally robust 

Average total discounted rewards 

-1.8 X 10“^ 

-2.9 X 10"^ 

-2.3 X 10"^ 

Computational times (seconds) 

0.643 

815 

820 


The performance of the strategies obtained by using the nominal, the robust and the distributionally 
robust approaches is presented in Fig. The corresponding average total discounted rewards and compu¬ 
tational times are shown in Table |Ij The nominal strategy results in the highest average total discounted 
rewards. This is well expected as we are using the exact mean value of the reward as the nominal 
parameter. However, the nominal strategy is highly risky: it cannot prevent the bad preformance (e.g., 
— 2.5 X 10“^) from happening, which is undesirable. These results coincide with what one would typically 
expect from the three solution concepts. While the nominal strategy, blind to any form of risk, finds no 
advantage in ever repairing, the robust strategy ends up following a highly conservative policy (repairing 
the machine at state 49 to avoid state 50). On the other hand, the distributionally robust optimal strategy 
makes use of more distributional information and handles the risk efficiently by waiting until state 50 and 
then repair the machine. Therefore, this strategy beats the nominal and robust strategies in that it strikes 
a good tradeoff between high mean reward and low variance over 10,000 different trials. These results 
coincide with what one would typically expect from the three solution concepts. 



Fig. 3. Performance comparisons between nominal, robust, and distributionally robust strategies on 10,000 runs of the machine replacement 
problem with Gaussian rewards (The right figure focuses on the interval [—0.0045, —0.001]). 


2) Machine replacement as a MDP with mixed Gaussian rewards: The second experiment considers a 
similar setup as the previous one, except that not repairing at the 50th state now has a reward which follows 
a mixed Gaussian distribution (see Fig.|^. This experiment illustrates the effect of the two different nested- 
set structures shown in Fig. [TJ In specific, we apply the two different distributionally robust approaches 
(proposed in this paper and in |[T| respectively) to this problem, and show that our method, which is more 
general, archives better performance. The detailed implementation is as follows: For the Gaussian mixture 
model, the estimated mean and variance for each Gaussian cannot be computed by simply taking mean 
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-Robust strategy 

-Distributionally robust strategy 1 

-Distributionally robust strategy 2 


0.9 A/'(100,10) © 0. lAf(U0, 2) 


Fig. 4. Instance of a machine replacement problem with mixed Gaussian uncertainty in the rewards. 




Fig. 5. Illustration of the confidence sets for two distributionally robust strategies. 


and variance of the entire sample data set. The observed data must be divided into several Gaussians, 
each of with its own mean and variance. We obtain the parameters for each Gaussian by applying the 
expectation-maximization algorithm p3] |. Note that the algorithm requires an initial guess as to how many 
Gaussians are hidden in the distribution. This can be done by checking the histogram of the observed 
sample data. We choose it to be 2 in this experiment. After we get the estimated means and variances, 
the estimated probability density function can be obtained. For the robust and two distributionallly robust 
strategies, we construct uncertainty set corresponding to 99% probability support of the rewards for the 
first 49 states, and 99.9% for the 50th state that is more risky, using the estimated probability density 
function. For the first distributionally robust strategy (the one proposed in |[T|), we construct two additional 
nested confidence sets of the uncertain rewards (see Fig. 5a): 40.96% confidence set O|o with 40% — 50% 
confidence level (lower/upper confidence bounds ttgg = 0.4, cSgQ = 0.5); 65.61% confidence set O|o with 


60% — 70% confidence level (lower/upper confidence bounds — 0.6, a^Q = 0.7). In contrast, for 
the second distributionally robust strategy (the one proposed in this paper), we construct two disjoint 
confidence sets of the uncertain rewards (see Fig. 5b): 77.97% confidence set OIq with 70% — 80% 
confidence level (lower/upper confidence bounds « 5 q = 0.7, a\Q = 0.8), and 9.90% confidence set O|o 
with 0% — 10% confidence level (lower/upper confidence bounds = 0,a|o = O-l)- Specifically, we 
select these two intervals around the peaks of the two Gaussian elements (i.e., A/^(100,10) and A/^(140,2)) 
to better model this mixed distribution. The optimal paths followed for three strategies are shown in Fig. 

a 


The performance of the strategies obtained by using robust and two distributionally robust approaches 
is presented in Fig. The corresponding average total discounted rewards and computational times are 
shown in Table |II| As expected, the robust strategy ends up following a highly conservative policy which 
repairs the machine at state 49 to avoid state 50. The first distributionally robust strategy, not modeling 
the mixture Gaussian distribution well, finds it advantageous to repair at the 50th state. In contrast, 
capable of capturing the distribution information in a more flexible way, the second distributionally robust 
strategy better models the uncertainty and finds that not repairing the machine at state 50 is optimal. The 
performance comparison clearly shows that the second distributionally robust strategy is more desirable, 
which highlights that the distributionally robust approach with general structure of confidence sets can be 
beneficial in practice. 
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Robust 

Distributionally 
robust 1 

Distributionally 
robust 2 


-3 -2 -1 

Discounted reward 


X 10 


Fig. 6. Performance comparisons between robust and two distributionally robust strategies on 10,000 runs of the machine replacement 
problem with mixed Gaussian rewards. 



Fig. 7. The maze for the path planning problem 


We remark that, in practice, one can obtain the modality structure of uncertain parameters in a data- 
driven way by applying clustering algorithms to an initial primitive data set. For example, one may check 
the histogram of historical observations. If the data concentrates on several distinct and disjoint bins, our 
multi-model DAMDP approach can be applied. 


TABLE II 

Average total discounted rewards and computational times of robust and two distributionally robust strategies 

IN MACHINE REPLACEMENT PROBLEM WITH MIXED GAUSSIAN REWARDS 



Strategies 

Robust 

Distributionally robust 1 

Distributionally robust 2 

Average total discounted rewards 

-2.9 X 10"^ 

-2.3 X 10"=^ 

-1.9 X 10“^ 

Computational times (seconds) 

849 

862 

820 


B. Transition uncertainty in the path planning problem 

In this subsection, we consider a path planning problem, similar to the one presented in Q: an agent 
wants to exit a 4 x 21 maze (shown in Fig. using the least possible time. Starting from the upper-left 
comer, the agent can move up, down, left and right, but can only exit the grid at the lower-right comer. 
Here, a white box stands for a normal place where the agent needs one time unit to pass through. A 
shaded box represents a “shaky” place: if an agent reaches a “shaky” place, then he may risk jumping 
to the starting point (“reboot”). The tme transition probability of the jump follows a distribution (1 — 
A)A/’(0.1,10“^) 0 AA/’(0.2,10“^) where A e (0,1]. The four approaches are implemented as follows: 
The nominal approach neglects this random jump. The robust approach takes a worst-case analysis, i.e., 
it assumes that with 30%, the whole probability support of transition, the agent will jump to the spot 
with the highest cost-to-go. The first distributionally robust approach takes into account an additional 
information by using two nested confidence sets: the jump probability parameter belonging to 9% — 11% 
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Fig. 8. Performance comparisons between nominal, robust and two distributionally robust strategies on 3,000 runs of the path planning 
problem. 


is of a confidence 1 — A. The second distributionally robust approach incorporate more information. In 
specific, we construct an extra confidence interval, which is disjoint with above 9% — 11% interval, and 
the chanee of jumping with probability 20% is A. 

The performance of strategies of the nominal, the robust and the two distributionally robust approaehes 
is presented in Fig. where the error bars show the standard error of the expeeted time to exit. The CPU 
times of computing optimal policies for four strategies are 0.461, 549, 642 and 654 seconds, respectively. 
The second distributionally robust approach, i.e., the approach proposed in this paper, outperforms the 
other three approaehes over virtually the whole spectrum of A. This is well expected, since additional 
probabilistic information is available to and ineorporated by the seeond distributionally robust approach 
which considers ambiguity sets with more general structures. 

VI. Conclusion 

In this paper, we eonsidered Markov decision problems with uncertainty. Specifieally, we generalized 
the distributionally robust approaeh proposed in |[T| to incorporate more general ambiguity sets [ [T^ to 
model a-priori probabilistie information of the uncertain parameters. We proposed a way to eompute the 
distributionally robust strategy through a Bellman type backward induction. We showed that the strategy, 
which achieves maximum expected utility under the worst admissible distributions of uncertain parameters, 
can be solved in polynomial time under some mild technical conditions. We believe that many important 
problems that are usually addressed using standard MDP models could be revisited and better resolved 
using the proposed models when parameter uncertainty exists, as this formulation naturally enables the 
decision maker to account for more general parameter uncertainty. 
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